Implement coalesced pooling over entire batches #368
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR ports a feature from
curated-transformers
that applies the pooling operation (one that's performed on the piece representations to get the token-level representations) on entire batches instead of individualDoc
s. This significantly reduces the overhead of launching the custom kernel behind the scenes, especially in high-throughput scenarios like inference.This change improves the GPU inference performance of the German transformer model (minus the trainable lemmatizer) by 32.5% (20171.7 WPS -> 26725.9 WPS). GPU training speed also sees a modest improvement of 4.6% (3547.7 WPS -> 3713 WPS).
Types of change
Feature
Checklist